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Abstract 

Background: The rapid advancements in the field of genome sequencing are aiding our understanding on many 
biological systems. In the last five years, computational biologists and bioinformatics specialists have come up with 
newer, better and more efficient tools towards the discovery, analysis and interpretation of different genomic 
variants from high-throughput sequencing data. Availability of reliable simulated dataset is essential and is the first 
step towards testing any newly developed analytical tools for variant discovery. Although there are tools currently 
available that can simulate variants, none present the possibility of simulating all the three major types of variations 
(Single Nucleotide Polymorphisms, Insertions and Deletions and Copy Number Variations) and can generate reads 
taking a realistic error-model into consideration. Therefore, an efficient simulator and read generator is needed that 
can simulate variants taking the error rates of true biological samples into consideration. 

Results: We report SlnC (Snp, Indel and Cnv) an open-source variant simulator and read generator capable of 
simulating all the three common types of biological variants taking into account a distribution of base quality score 
from a most commonly used next-generation sequencing instrument from lllumina. SlnC is capable of generating 
single- and paired-end reads with user-defined insert size and with high efficiency compared to the other existing 
tools. SlnC, due to its multi-threaded capability during read generation, has a low time footprint. SlnC is currently 
optimised to work in limited infrastructure setup and can efficiently exploit the commonly used quad-core desktop 
architecture to simulate short sequence reads with deep coverage for large genomes. 

Conclusions: We have come up with a user-friendly multi-variant simulator and read-generator tools called SlnC. 
SlnC can be downloaded from http://sourceforge.net/projects/sincsimulator. 



Background 

The rapid advancements in the field of genome se- 
quencing is aiding our understanding of genome organ- 
isation in many biological systems [1-3]. These tools are 
intended to analyse high throughput next-generation 
sequence (NGS) data and present biologically relevant 
interpretations. Given the high throughput nature of 
present day genomics, heuristic algorithms are implica- 
ted to identify or predict genome variations as small as 
single base nucleotide substitutions (SNVs) to insertion- 
deletion events (indels) and copy number variations 
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(CNVs). Hence, it is imperative for developers of NGS 
data analysis pipelines to establish the limits of their pre- 
dictions based on simulated data as in current practice. In 
the last five years, computational biologists and bioinfor- 
matics specialists have developed new algorithms for dif- 
ferent types of variant calling, have implemented existing 
algorithms for short- read mapping to reference genomes 
and/or optimized pipelines to perform a specific type of 
primary and secondary analysis [4-19]. SNVs, indels and 
CNVs are the most common types of biological variations 
in the genome. The tools to detect these variants have the 
common objective of finding novel variations with low fre- 
quency of false positives, rediscovering known variations 
in the genome of interest and facilitate subsequent gen- 
ome visualization and interpretation. Hence, availability of 
reliable and realistic simulated dataset bearing the three 
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major types of genomic variations (SNVs, indels and 
CNVs) is critical to test the operational limitations of 
newly developed or existing tools. This approach allows 
computational biologists to generate simulated datasets 
with biological meaning and sensitive to systematic error 
inherent to different sequencing technology platforms. 

Although, next-generation sequencing (NGS) instru- 
ments generate reads of various lengths and with varying 
error profiles, the most popular source of data remains 
sequencing instruments from Illumina, which employs a 
sequencing-by-synthesis [20] chemistry to generate short- 
reads. Keeping this in mind, we have developed an effi- 
cient, fast simulator and a read generator that mimics 
sequencing quality generated by Illumina platform. Hence, 
SInC uses a realistic error model based on base quality 
values of reads generated by the most prevalent sequen- 
cing platform, hence catering to the larger interest group. 
Although we have used the Illumina-derived base quality 
values, it can easily be adopted for any other sequencing 
platform by supplying an instrument-specific error profile. 

Currently available tools can either generate platform- 
specific, error-profile based reads or simulate reads across 
platforms [21-28]. It is also in our interest of disambigu- 
ation to classify the existing simulators into two major 
classes based on their functionality. First, the stand-alone 
read generators (RG) like Metasim [28], Flowsim [22], 
454Sim [24], Pbsim [21], GenFrag [29] and ART [25] 
among others with functionality limited to read genera- 
tion. The second class of simulators (SRG) include pIRS 
[23], GEMsim [26], dwgSIM [30] (based on wgsim of sam- 
tools), which have the option of simulating genomic varia- 
tions coupled with read generation functionality. Each of 
the above mentioned tools, although has its own set of 
advantages, suffers from either having a simplistic error 
model (in the case of GenFrag), errors that does not 
model real data (in the case of dwgSIM), does not assign 
quality values to reads (in the case of Metasim), does not 
simulate Illumina reads (in the case of Flowsim) or does 
not simulate multiple types of variations (in the case 
of pIRS and GEMsim). Interestingly, none of the exis- 
ting SRG simulators present the option to simulate 
CNVs. Hence, we have developed and implemented a 
C-program, SInC, to enable simulation of all the three 
major types of genomic variations, SNVs, indels and 
CNVs, coupled with a multi-threaded, error-profile based 
read generator. SInC has obvious advantages over the 
popular SRG simulators as dwgSIM simulates reads with 
identical dummy base quality values relieving the data of 
any base-quality related effects, pIRS cannot simulate 
CNVs and GEMsim simulates only SNVs. SInC models er- 
rors based on real data from Illumina instruments as in 
pIRS and additionally presents fine tuned options to repli- 
cate biologically meaningful variant simulations including 
CNVs. The multi-threaded algorithm in SInC for read 



generation provides substantial advantage in run time and 
allows for seamless simulation of high coverage data in a 
desktop environment. 

Here we present an evaluation of SInC using com- 
monly used SNV, indel and CNV detection tools. The 
speed, accuracy and efficiency was compared against other 
popular simulators and read generators. 

Implementation 

SInC performs two jobs; first it simulates variants (simu- 
lator) and then it generates reads (read generator). SInC 
simulator consists of three independent modules (one 
each for SNV, indel and CNV) that can either be exe- 
cuted independently in a mutually exclusive manner or 
in any combination. 

SNV simulation 

The exact frequency of SNPs in the human genome has 
not yet been determined accurately. Based on inferences 
from 629 complete genomes representing several human 
populations in the 1000 genome data, the current range 
of frequency of SNV lies between one per 300 to 1000 
bases [31]. For this purpose, we have assumed that the 
substitution events in human genome are independent 
and random. SInC simulator accepts a user defined per- 
centage value to simulate SNVs. The algorithm identifies 
this percentage value as the fraction of genome to esti- 
mate the number of SNVs and simulates SNVs across 
the genome. To maintain positional identities of these 
SNVs with respect to their frequency, that are normally 
distributed over the sequenced genome, the mean dis- 
tance of separation (DAvg) between SNVs is calculated 
(see Additional file 1 and Additional file 2). 

This ensures that the simulated SNVs are well dis- 
tributed over the genome. A positional filter is applied 
to remove the outlier SNVs, which are less than 15 bases 
apart. SInC simulator neglects SNVs simulated in the N- 
regions of the genome (where there is no A, T, G or C). 
Then the algorithm applies a user-defined transition to 
transversion (Ti/Tv) metric to maintain the biological 
significance of the SNVs across the genome. A Ti/Tv ra- 
tio of 2.1 was maintained across the population of simu- 
lated SNVs with 20% inherent heterozygosity to simulate 
human genome data as previously reported [32]. The 
flow chart illustrated in Figure 1A depicts the algorithm 
for simulating SNVs. 

Indel simulation 

Insertion and deletion (indel) events have a wide range 
of size-based variability. SInC simulator simulates short, 
medium and large indels in the range of 1-10 bp 11- 
20 bp and 21-100 bp respectively in concordance with 
earlier studies [33]. The ratio of incidence of insertions 
to deletions and heterozygous to homozygous indels in 
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Figure 1 Flowcharts indicate the algorithm implemented in simulation of A) SNP 7 B) lndel 7 C) CNV and subsequent D) read 
generation process. 
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human genome is set to 1:1 based on previous obser- 
vations [33]. The flow chart in Figure IB depicts the 
algorithm for simulating indels. 

The algorithm first randomly generates the position 
for indels and then uses a filter to replace any indel 
within the region of the N-region of the genome (no A, 
T, G or C assigned) with one in the sequenced region. 
To remove duplicates, the simulated indels are coordin- 
ate sorted and only the unique locations are retained. 
Usually, a redundancy of 2-5% is filtered out post co- 
ordinate sorting (that can result from either duplicates 
or un-sequenced regions). Hence, an additional 5% of 
indels are generated at the initial stage of the algorithm 
to account for the loss of indels at the duplicate removal 
step. In the next stage of the algorithm, the frequency of 
short, medium and large indels was factored in based on 
previous literature evidence for their distribution in hu- 
man [33]. The indel simulation produces two output 
files assuming the bi-allelic nature of human genes, each 
containing allele-specific coordinate information of sim- 
ulated indels. Among the total number of simulated 
indels, the algorithm simulates 30% single base indels, 
20% repeat expansions, 49% 2-20 bp indels, and 1% long 
indels including repeat expansions (see Additional file 2 
and Additional file 3). 

CNV simulation 

The CNV simulation constitutes the final step of the 
simulation algorithm, as it can ply in a sequential man- 
ner post indel simulation. Since the input files from 
indel simulation may contain heterozygous indels, which 
may be of unequal size, hence the CNV module takes it 
into consideration and prevents the possibility of boun- 
dary overlaps with indels. The flow chart in Figure 1C 
depicts the algorithm for simulating CNVs. 

Unlike the indel module, here the size and location of 
the variants are both generated dynamically with the 
flow of the program after obtaining the feed from the 
user to determine the number of CNVs and their range 
of size distribution (upper and lower limit). Such simu- 
lated data is particularly useful to test the accuracy and 
sensitivity of a new or existing CNV caller across a wide 
range of CNV sizes. The algorithm then filters the simu- 
lated CNVs based on its coordinates. First the span of 
each of the CNVs are evaluated to ensure correspon- 
dence with chromosomal boundary in either allele and 
subsequently the CNV boundaries are checked for over- 
lap with neighboring CNVs. The CNV is logged and the 
next iteration of location and size are generated upon 
meeting the aforementioned conditions. Unlike the SNV 
and indel simulation modules, the annotation data for 
both alleles in the CNV module is stored in the same file 
in the form of a tabular data. The tool also outputs a 



simplified results file (similar to a BED file), which can 
be read easily by any program for visualizing CNVs. 

Read generation 

SInC has a read generator part that generates short 
reads using a multi-threaded approach utilizing the par- 
allel processing power of commonly used quad-core 
desktop/laptop architecture. The process of read gene- 
ration uses a quality profile distribution-based error- 
model. We have used publicly available 100 bp read pair 
data derived by using Alumina instrument from the SRA 
database to generate the quality profile distribution and 
assessed the quality distribution for both forward and re- 
verse reads of the training set and data post read gener- 
ation (Additional file 4). For customization purpose, the 
tool is provided as a standalone tool in the SourceForge 
package so that the user can generate independent error 
profiles for reads with certain read lengths to be used 
during the read generation stage. The read generation al- 
gorithm follows a "divide and conquer" approach where 
each thread spans the input sequence once and the num- 
ber of reads required to obtain the user defined coverage 
are pooled from the estimated number of threads. User- 
specified cores utilization is implemented in the SInC read 
generator to prevent over-utilization of available CPU 
resources. The other major user defined parameters, in- 
clude read length, error profile, insert size (inner distance) 
and standard deviation of insert size (see Additional 
file 2). The algorithm initially creates one thread, which 
generates reads for the input fasta file. Depending on the 
read pairs generated in the first run, the numbers of 
threads required to obtain the desired coverage are calcu- 
lated and then executed in an iterative manner based on 
the number of cores specified by the user (Figure ID). 
SInC is optimized to run with 4 threads suiting a quad- 
core processor. 

Evaluation of SInC 
Variant re-discovery 

We used human chromosome 22 sequence from the 
UCSC build hgl9 for generating SNVs and indels using 
all the four different SRG simulators. The SNV rate, 
indel percentage and coverage was maintained across all 
the tools and the resulting reads were aligned using 
Novoalign [14]. These mapped files were subject to SNV 
and indel detection by GATK [4] and Pindel [6] respec- 
tively (see Results, Figure 2). The predicted SNVs and 
indels from the different simulators were compared to 
the actual number of incorporated variants to estimate 
the percentage rediscovery. Rediscovery percentage using 
Pindel has a limitation that it merges short indels within a 
span of 40 nucleotides of each other leading to a slight 
loss (less than 1%) of rediscovered indels across all the 
simulators (see Additional file 2). 
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HET SNP rediscovery C HOM SNP rediscovery D 




Figure 2 Variant rediscovery statistics. Percentages of simulated variants performed using GATK and PINDEL for identification are shown of 
A) SNVs and B) indels respectively. The rediscovery of indels based on size specificity was also performed and is given in Additional file 3. The 
rediscovery percentages of C) heterozygous and D) homozygous SNVs are compared. 



Time profiling 

Given the high-throughput nature of NGS data, genera- 
ting the bulk of simulated data still remains a time con- 
suming process. Hence, we have implemented a "divide 
and conquer" approach to the read generation module 
to reduce the time footprint in generating high coverage 
data. This property allows user to simulate data at a high 
coverage (50X - 100X) without inordinate expense of 
time. SInC can utilize 1 to 4 threads for optimal func- 
tion. Our comparison was set up based on default use of 
1 core ranging upto a maximum utilization of 4 cores in 
SInC versus the other tools (see Results, Figure 3). De- 
tails are provided in the Additional file 2. 

Transition/Transversion (Ti/Tv) ratio 

A transition mutation involves a change from purine to 
purine or pyrimidine to pyrimidine and a tranversion 
mutation involves a change from pyrimidine to purine 
or vice versa. This makes a transversion event twice as 
favourable as a transition event for any random muta- 
tion event. Hence, the Ti/Tv ratio for a random variation 



resulting from systematic errors in the sequencing tech- 
nology, alignment artifacts and data processing failures 
should be close to 0.5. As published earlier, Ti/Tv ratio 
for whole genome falls between 2.05 - 2.15 for both 
known and unknown SNPs. SInC incorporates a user- 
defined Ti/Tv ratio for simulation of SNVs. 

All the scripts to simulate variants and generate reads 
used default parameters and details of the scripts used 
are given in the Additional file 5. 

Results and discussion 

We have developed a simulator for all commonly occur- 
ring biological variants in the genome along with a read 
generator. We compared the latest pick of simulators 
with SInC simulator and read generator. In our model 
for SNV simulation, we have limited the range of simula- 
tion of SNVs using a distribution of distance between 
two consecutive SNVs. Based on SNV frequency studies 
in human genome [31,34], under default simulation pa- 
rameters the mean distance between two consecutive 
SNVs, DAvg, is set dynamically between 300 to 1000 



Pattnaik et al. BMC Bioinformatics 2014, 15:40 
http://www.biomedcentral.eom/1 471 -21 05/1 5/40 



Page 6 of 9 




bases depending upon user defined input for SNV rate. 
In indel simulations, the complexity of simulation de- 
pends on the frequency of indels in the simulated data. 
In the default mode for indel simulation, the algorithm 
is sensitive to the natural frequencies and size ranges as 
evidenced from existing literature [33,35]. The model 
for CNV simulation is an extension of the indel simula- 
tion, wherein the CNVs are dynamically generated while 
maintaining the allele specificity and genomic positions 
of indels simulated in the prior step. The simulated vari- 
ants are captured in log files, combined with input allele 
fasta files and processed by a multi-threaded process to 
enable fast-paced read generation. 

In order to assess the number of variants post simula- 
tion and read-generation in comparison to the number 
of variants that a sensitive variant caller like GATK [4] 
identifies, we used variant re-discovery rate as one of the 
parameters of evaluation. Variant re-discovery, although 
not linked with the efficiency of the simulator, can be 
used as one of the parameters to judge the combined 
efficiency of the simulator + read-generator + variant cal- 
ling process. The process of re-discovery can be im- 
pacted by the read-generation step, which incorporates 
different error models and/or base quality values. Given 
the fact that we used the same tool (GATK [4] for SNPs 
and Pindel [6] for indels) across all the simulators + 
read-generator pairs, discrepancies in the number of var- 
iants rediscovered provided us with indirect evidence on 
the individual tools' efficiency. The most likely expla- 
nation for varying re-discovery rate could be due to dif- 
ferent methodology adopted during the process of read 
generation where different error models (and quality 
values) are taken into consideration. Additionally, vari- 
ant re-discovery rate is a widely used parameter to assess 
the quality of variant calling and analysis, including in 
the 1000 genome project. In order to delineate coverage 
from that of combined simulation + read-generation + 



variant calling and re-discovery, we simulated reads at 
10X, 20X or 30X coverage and found that the coverage 
does not affect the re-discovery rate of SNPs using SInC 
(Additional file 6). The SNV rediscovery percentage sug- 
gested that SInC was at par with pIRS in the efficiency 
of simulating SNVs and comprehensively outperformed 
both GEMsim and dwgSIM (Figure 2A), suggesting the 
role of similar error-profile based model during the read 
generation process. Although other tools like dwgSIM 
and GEMsim are close to SInC in homozygous redis- 
covery rate (Figure 2D), SInC outperforms both these 
tools for heterozygous rediscovery rate (Figure 2C) sug- 
gesting the importance in simulating both homozygous 
and heterozygous real variants. In the rediscovery of in- 
dels, SInC emerges as the only simulator with the high- 
est percentage of total rediscovered indels, ahead at least 
by 15% from the closest contender pIRS (Figure 2B). We 
further tested the accuracy of the rediscovered indels by 
adding a size-based constraint and estimated the per- 
centage rediscovered in the size ranges containing 1 to 
6, 7 to 10, 11 to 20 and 21 to 100 nucleotide long indels. 
These size ranges were simulated due to their overall 
high (greater than 95%) natural prevalence in human ge- 
nomes [35]. This exercise corroborated the superiority 
of SInC in detecting indels while retaining the size-based 
constraints implicated in the simulation algorithm in 
comparison to the other tools tested (Figure 2B). The 
numbers of SNVs and indels rediscovered by SInC are 
especially important because the total number of SNVs 
simulated by SInC is about 10-20% more than the other 
tools tested and 20-40% higher for indels. Another sig- 
nificant advantage of SInC is apparent from the redis- 
covered heterozygous SNVs. As depicted in Figure 2C, 
the difference in homozygous SNV rediscovery is rather 
conserved across the simulators compared to Figure 2D, 
which gives SInC an edge in conservation of zygosity of 
the calls post read generation. Notably, pIRS although 
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uses a similar error-profile as SInC, it does not catalog 
the simulated SNVs to facilitate rediscovery of heterozy- 
gous and homozygous SNVs separately. The CNV module 
of SInC simulator was used in a previous study to test a 
CNV prediction tool, COPS [17], and was used to com- 
pare its accuracy and sensitivity to other popular CNV 
prediction tools. We were unable to perform a compara- 
tive analysis of the CNV module in SInC due to the un- 
availability of any published tools that can simulate CNVs. 
However, as previously shown [17], the percentage redis- 
covery using multiple CNV discovery tools like CNAseg 
CNV-seq, CNVnator and SVDetect yielded >90% CNVs. 

Next, we wanted to test the speed of SInC read gener- 
ator. Figure 3 depicts the advantages that SInC provides 
during read generation due to implementation of a "div- 
ide and conquer" approach by efficient utilization of C 
thread functions. The tool was tested for its processing 
capability under a range of multi-threaded options ran- 
ging from default utilization of 1 core to a maximum 
utilization of 4 cores. SInC accomplished read gener- 
ation at least one and a half times faster than pIRS and 
three times faster than ART; the two most recent Illu- 
mina read simulators (see Additional file 2). The time 
profile demonstrated substantial reduction in time foot- 
print using SInC in comparison to the other tools sam- 
pled in our study. This difference in generation time of 
simulated data is reflected clearly in generating high co- 
verage datasets from large genomes, human genome in 
our case as shown in Figures 3B and C. 

Although there are a multitude of popular tools capable 
of predicting genomic variations using high-throughput 
sequence data, the generality of such tools are question- 
able. In many ways, a simulated dataset is crucial towards 
determining the success of predictive algorithms in the 
context of real dataset. Simulators that can simulate vari- 
ants and generate reads are valuable tools used for devel- 
oping and testing tools for sequence data analysis. An 
ideal tool that can both simulate multiple variant types 
(SNVs, indels and CNVs) and generate sequencing reads 
taking into account a realistic platform-specific quality- 
profile of an sequencing instrument is currently lacking. 
We tried filling this void by designing a versatile and fast 
tool that can generate multiple types of biological variants 
(SNVs, indels and CNVs) and can run on a minimalistic 
quad core desktop computer using multi-threaded option. 
The time advantage obtained in SInC could be attrib- 
uted to the optimized algorithms and efficient use of 
C thread functions to manage the I/O streams. This 
advantage is also obvious in a single core, which de- 
legates the bulk of the data generation to multiple 
threads to ensure efficient use of memory in line with 
"divide and conquer" approach. The optimization of mul- 
tiple core usage is available upto 4 cores in quad-core 
architecture. 



Another major functional advantage of this tool is 
its ability to simulate CNVs. CNVs have been shown to 
contribute more towards genetic diversity than SNVs 
and are conspicuous by their pervasiveness in human 
genome [36-39]. The advent of NGS platforms has geared 
multiple efforts to build frameworks towards identifying 
CNVs and assess their penetrance in disease etiology. 
However, most of these efforts are only partially ef- 
fective in capturing population-based generalizations. 
In order to build a robust and generic framework, it 
is imperative to build exhaustive datasets with the known 
signatures and explore the range of false discovery rates 
inherent to the tools and subsequently improve them. 
The ability to create such datasets will definitely im- 
prove the approach and accuracy of predictions made 
by existing tools. Hence, a flexible, user-input based 
simulator has substantial application in building useful 
datasets allowing for improvement of current approaches 
towards variant discovery as a whole. Although there have 
been efforts in the past to discovering CNVs using NGS 
data, currently there are no available simulators to fine- 
tune CNV detection algorithms. SInC simulator not only 
fulfills the simulation of CNVs but an additional function- 
ality of SInC simulator is to generate allele-specific CNVs. 
This is particularly useful if one has to understand the 
copy number changes at an allelic level important for 
many diseases [40,41]. 

Production of large amount of heterogeneous data in 
high-throughput biology requires sophisticated compu- 
tational tools for efficient analysis, storage, sharing and 
archiving. This requires resources, both software and 
hardware, and interoperability of computational re- 
sources. A common practice among computational bio- 
logist is to use simulated data to test the efficacy of 
the tools before applying them to real dataset. Al- 
though there are many simulators available currently, 
there is none that suits the need of every computa- 
tional biologist wanting to make tools for short-read 
sequence data. Keeping this in mind, we have deve- 
loped a tool to help computational biologists create 
simulated datasets using only one simulator that can 
span across sequencing platforms and variant types 
(SNVs, indels and CNVs). Although, SInC simulator 
was tested with human genome, it is versatile to ad- 
dress the complexity of any genome, its substitution 
rate, variant frequency and transition to transversion 
ratio. Large genomes, like that from many plants, 
need time to generate simulated reads at high cover- 
age and this is where the multi-threaded capability of 
SInC scores high in comparison to other tools. By 
using a standalone quality-score distribution model of 
real dataset, SInC provides an opportunity to individ- 
ual user to generate reads at different read lengths but 
with realistic quality. 
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Conclusions 

We report a tool called SInC that can simulate and gen- 
erate short sequence reads with different types of bio- 
logical variants. The ability of SInC to generate realistic 
fastq reads based on Illumina read quality profiles along 
with its capacity to simulate multiple biological variants 
and generate reads concurrently makes it a powerful op- 
tion in a variety of simulation studies and a part of com- 
putational biologists' essential toolkit. 

Availability and requirements 

Project name: sincsimulator 

Project home page: http://sourceforge.net/projects/ 
sincsimulator 

Operating system(s): Linux 
Programming Language: C 

Other Requirements: GNU Scientific Library(gsl library), 
pthreads library 

License: Creative Commons Attribution Non-Commercial 
License V2.0 

Any restrictions to use by non-academics: License 
needed 

Additional files 
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SNP: Single nucleotide polymorphism; Indel: Insertions and deletions; 
CNV: Copy number variations. 
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